Goal is to classify the patients into the respective labels using the attributes from their voice recordings
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,S
himmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation 9. car name: string (unique for each instance)
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
Data-Parkinsons.csv
#Import all the necessary modules
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import random
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from IPython.display import Image
from sklearn import tree
from os import system
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn. preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
parkinson_df = pd.read_csv('Data-Parkinsons.csv')
parkinson_df.head(10)
#0s signify a lot of missing values
parkinson_df.shape
Given data has total 195 rows and 24 columns
parkinson_df.dtypes
__status__ - is the target classification attribute to classify the patients into the respective labels using the attributes from their voice recordings. subject (one) - Parkinson's, (zero) - healthy
Except name and status all other attributes are numerical
parkinson_df.describe()
parkinson_df.info()
Total 24 attributes where 22 are numerical and one (status) is required classification binary variable and other one is (name) object type which is unique
parkinson_df.isna().sum()
Data doesn't contain any null values
Encoding the Categorical values into numerical values is not required in this dataset. Because all values we have floating type only. we have name column as a categorical values but we are not going to use that column in model prediction.
So no need to apply label encoding...
parkinson_df_updated = parkinson_df.drop(["name"],axis=1)
parkinson_df_updated.groupby("status").agg({'status': 'count'})
sns.countplot(x='status',data=parkinson_df_updated)
This represents the distribution of status variable in the given data
columns = list(parkinson_df_updated.drop(["status"],axis=1))[0:-1]
parkinson_df_updated[columns].hist(bins=10, figsize=(15,30), layout=(8,3));
plt.figure(figsize= (50,50))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.5)
i=1
for feature in parkinson_df_updated.drop(["status"],axis=1).columns: # Loop through all columns in the dataframe
plt.subplot(8,3,i)
b = sns.boxplot(x= parkinson_df_updated[feature])
b.set_xlabel(feature,fontsize=30)
b.tick_params(labelsize=30)
i=i+1
__MDVP:Fo(Hz)__ - Seems Evenly Distributed and most data is in between 125 to 175 and doesn't have outliers
__MDVP:Fhi(Hz)__ - Seems Evenly Distribute and most data is in the range between 50 and 250 and has outliers
__MDVP:Flo(Hz)__ - Seems Right Skewed and most of the data in between 75 to 150 and has outliers
__MDVP:Jitter(%)__ - Seems Right Skewed and most of the data in between 0.005 to 0.01 and has outliers
__MDVP:Jitter(Abs)__ - Seems Right Skewed and has outliers
__MDVP:RAP____ - Seems Right Skewed and has outliers
__MDVP:PPQ__ - Seems Right Skewed and has outliers
__Jitter:DDP__ - Seems Right Skewed and has outliers
__MDVP:Shimmer__ - Seems Right Skewed and has outliers
__MDVP:Shimmer(dB)__ - Seems Right Skewed and has outliers
__Shimmer:APQ3__ - Seems Right Skewed and has outliers
__Shimmer:APQ5__ - Seems Right Skewed and has outliers
__MDVP:APQ__ - Seems Right Skewed and has outliers
__Shimmer:DDA__ - Seems Right Skewed and has outliers
__NHR__ - Seems Right Skewed and has high volume of outliers
__HNR__ - Seems Evenly Distributed and very less outliers
__RPDE__ - Seems Right Skewed but no outliers
__DFA__ - Seems Left Skewed but no outliers
__spread1__ - Seems Right Skewed and has outliers
__spread2__ - Seems Very Less Left Skewed has outliers
__D2__ - Seems Evenly Distributed and has very less outliers
__PPE__ - Seems Evenly Distributed and has outliers
sns.pairplot(parkinson_df_updated, hue="status", palette="husl")
parkinson_df_updated.corr()
plt.figure(figsize = (50,50))
sns.heatmap(parkinson_df_updated.corr(), annot = True, linewidths=.5)
__We can observe the highest correlation between variables like MDVP:PPQ,MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,Jitter:DDP,MDVP:Shimmer e.t.c from the above heatmap__
correlation_values=parkinson_df_updated.corr()['status']
correlation_values.abs().sort_values(ascending=False)
Above is the correlation values with respect to status variable in descending order.
Y = parkinson_df_updated["status"]
X = parkinson_df_updated.drop(["status"],axis=1)
X_Train,X_Test,Y_Train,Y_Test = train_test_split(X, Y, test_size=0.3, random_state=1)
X_Train.head()
Y_Train.head()
X_Train.isna().sum()
Y_Train.isna().sum()
No Null values found in training set's both dependent and independent variable values.
logmodel = LogisticRegression()
logmodel.fit(X_Train,Y_Train)
predict = logmodel.predict(X_Test)
predictProb = logmodel.predict_proba(X_Test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_Test, predict)
print(acc)
# Confusion Matrix
cm = confusion_matrix(Y_Test, predict)
class_label = ["Positive", "Negative"]
df_cm = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print(classification_report(Y_Test, predict))
# Creating odd list of K for KNN
myList = list(range(1,20))
# Subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))
# Empty list that will hold accuracy scores
ac_scores = []
# Perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_Train, Y_Train)
# Predict the response
Y_Pred = knn.predict(X_Test)
# Evaluate accuracy
scores = accuracy_score(Y_Test, Y_Pred)
ac_scores.append(scores)
# Changing to misclassification error
MSE = [1 - x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
knn = KNeighborsClassifier(n_neighbors= 3 , weights = 'uniform', metric = 'euclidean')
knn.fit(X_Train, Y_Train)
predicted = knn.predict(X_Test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_Test, predicted)
print(acc)
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
# Confusion Matrix
cm1 = confusion_matrix(Y_Test, predicted)
class_label = ["Positive", "Negative"]
df_cm1 = pd.DataFrame(cm1, index = class_label, columns = class_label)
sns.heatmap(df_cm1, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# Classification Report
print(classification_report(Y_Test, predicted))
# Model
naive_model = GaussianNB()
naive_model.fit(X_Train, Y_Train)
prediction = naive_model.predict(X_Test)
naive_model.score(X_Test,Y_Test)
# Confusion Matrix
cm2 = confusion_matrix(Y_Test, prediction)
class_label = ["Positive", "Negative"]
df_cm2 = pd.DataFrame(cm2, index = class_label, columns = class_label)
sns.heatmap(df_cm2, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
print(classification_report(Y_Test, prediction))
svc = SVC()
svc.fit(X_Train, Y_Train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_Train, Y_Train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_Test, Y_Test)))
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_Train)
X_test_scaled = scaler.fit_transform(X_Test)
svc = SVC()
svc.fit(X_train_scaled, Y_Train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train_scaled, Y_Train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test_scaled, Y_Test)))
svc = SVC(C=1000)
svc.fit(X_train_scaled, Y_Train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train_scaled, Y_Train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, Y_Test)))
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_Train, Y_Train)
print(dTree.score(X_Train, Y_Train))
print(dTree.score(X_Test, Y_Test))
train_char_label = ['No', 'Yes']
Credit_Tree_File = open('decision_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(X_Train), class_names = list(train_char_label))
Credit_Tree_File.close()
from os import system
retCode = system("dot -Tpng decision_tree.dot -o decision_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("decision_tree.png"))
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR.fit(X_Train, Y_Train)
print(dTreeR.score(X_Train, Y_Train))
print(dTreeR.score(X_Test, Y_Test))
train_char_label = ['No', 'Yes']
Credit_Tree_FileR = open('decision_treeR.dot','w')
dot_data = tree.export_graphviz(dTreeR, out_file=Credit_Tree_FileR, feature_names = list(X_Train), class_names = list(train_char_label))
Credit_Tree_FileR.close()
#Works only if "dot" command works on you machine
retCode = system("dot -Tpng decision_treeR.dot -o decision_treeR.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("decision_treeR.png"))
__Note__: Didn't observe any improvement with regularization with max_depth of 3. So I am using dTree only for further analysis instead of dTreeR.
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_Train.columns).sort_values(by="Imp",ascending=False))
print(dTree.score(X_Test , Y_Test))
y_predict = dTree.predict(X_Test)
cm=metrics.confusion_matrix(Y_Test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)
bgcl = bgcl.fit(X_Train, Y_Train)
y_predict = bgcl.predict(X_Test)
print(bgcl.score(X_Test , Y_Test))
cm=metrics.confusion_matrix(Y_Test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(X_Train, Y_Train)
y_predict = abcl.predict(X_Test)
print(abcl.score(X_Test , Y_Test))
cm=metrics.confusion_matrix(Y_Test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_Train, Y_Train)
y_predict = gbcl.predict(X_Test)
print(gbcl.score(X_Test, Y_Test))
cm=metrics.confusion_matrix(Y_Test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=6)
rfcl = rfcl.fit(X_Train, Y_Train)
y_predict = rfcl.predict(X_Test)
print(rfcl.score(X_Test, Y_Test))
cm=metrics.confusion_matrix(Y_Test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
__Note__: Here I am using same Model Parametes for comparison after tuning them with different values previously like n_neighbours for KNN, C=1000 for SVC and max_features=6 for RandomForest Classifier. Here I am going to compare all the Models with different data sets like Train, Test and All. I am going to do this comparison with both the scenarios of without Feature Scaling and with Feature Scaling.
models = []
models.append(('KNN', KNeighborsClassifier(n_neighbors= 3 , weights = 'uniform', metric = 'euclidean')))
models.append(('LR', LogisticRegression()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(C=1000)))
models.append(('DT', DecisionTreeClassifier()))
models.append(('BA', BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)))
models.append(('AB', AdaBoostClassifier(n_estimators=10, random_state=1)))
models.append(('GB', GradientBoostingClassifier(n_estimators = 50,random_state=1)))
models.append(('RF', RandomForestClassifier(n_estimators = 50, random_state=1,max_features=6)))
# Evaluate each model with training data
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X_Train, Y_Train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with Training Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Evaluate each model with testing data
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X_Test, Y_Test, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with Test Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Evaluate each model with all data
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with All Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
pipelines=[]
pipelines.append(('S_KNN',Pipeline([('scaler',StandardScaler()),('KNN',KNeighborsClassifier(n_neighbors= 3 , weights = 'uniform', metric = 'euclidean'))])))
pipelines.append(('S_LR',Pipeline([('scaler',StandardScaler()),('LR',LogisticRegression())])))
pipelines.append(('S_NB',Pipeline([('scaler',StandardScaler()),('NB',GaussianNB())])))
pipelines.append(('S_SVM',Pipeline([('scaler',StandardScaler()),('SVM',SVC(C=1000))])))
pipelines.append(('S_DT',Pipeline([('scaler',StandardScaler()),('DT',DecisionTreeClassifier())])))
pipelines.append(('S_BA',Pipeline([('scaler',StandardScaler()),('BA',BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1))])))
pipelines.append(('S_AB',Pipeline([('scaler',StandardScaler()),('AB',AdaBoostClassifier(n_estimators=10, random_state=1))])))
pipelines.append(('S_GB',Pipeline([('scaler',StandardScaler()),('GB',GradientBoostingClassifier(n_estimators = 50,random_state=1))])))
pipelines.append(('S_RF',Pipeline([('scaler',StandardScaler()),('RF',RandomForestClassifier(n_estimators = 50, random_state=1,max_features=6))])))
# Evaluate each model with training data
results = []
names = []
scoring = 'accuracy'
for name, model in pipelines:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X_Train, Y_Train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with Training Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Evaluate each model with testing data
results = []
names = []
scoring = 'accuracy'
for name, model in pipelines:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X_Test, Y_Test, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with Test Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Evaluate each model with all data
results = []
names = []
scoring = 'accuracy'
for name, model in pipelines:
kfold = model_selection.KFold(n_splits=10, random_state=1)
cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Models Comparison with All Data')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()